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Abstract 

Background: Novel dose-finding designs, using estimation to assign the best estimated maximum- 
tolerated-dose (MTD) at each point in the experiment, most commonly via Bayesian techniques, have 
recently entered large-scale implementation in Phase I cancer clinical trials and similar studies. 

Purpose: To examine the small-sample behavior of these "Bayesian Phase I" (BPl) designs, and 
also of non-Bayesian designs sharing the same main "long-memory" traits. We refer to this family of 
designs as LMPl ("long-memory Phase I"). 

Methods: Data from several recently published BPl experiments are presented and discussed, and 
LMPl's operating principles are explained. A simulation study compares the small-sample behavior of 
long-memory and short-memory designs, on measures that are seldom examined, in particular run-to-run 
variability. 

Results: For all LMPls examined, the number of cohorts treated at the true MTD (denoted here 
as n*) was highly variable between numerical runs drawn from the same toxicity-threshold distribution, 
especially when compared with 'up-and-down" (U&D) short-memory designs. Further investigation using 
the same set of thresholds in permuted order, produced a nearly-identical magnitude of variability in 
n*. Therefore, this LMPl behavior is driven by a strong sensitivity to the order in which toxicity 
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thresholds appear in the experiment. We suggest that the sensitivity is related to LMPl's tendency to 
"settle" early on a specific dose level, a tendency known in literature and seen in two of the presented 
experiments. The "settling" tendency is caused by the repeated likelihood-based "winner-takes-all" dose 
assignment rule, which grants the early cohorts a disproportionately large influence upon experimental 
trajectories. A secondary point highlighted by our study is specific to the Bayesian designs: for BPls, 
the interplay between model form, prior distribution, and the need to produce plausible early-cohort 
behavior, generates a set of constraints and dependencies that is hard to control, and in certain ways 
contradicts the rationale of Bayesian methodology. 

Limitations: While the numerical evidence for LMPl's high run-to-run variability is broad, and 
sensible explanations for it are provided, we do not present a theoretical proof of the phenomenon. 

Conclusions: Method developers, analysts and practitioners should be aware of LMPl's variability 
and order-sensitivity, and of the factors driving them. Presently, U&D designs offer a simpler and 
more stable alternative, with roughly equivalent MTD estimation performance A promising direction for 
combining the two approaches is briefly discussed (note: the '3-f3' protocol is not a U&D design). 

Keywords: Bayesian Sequential Designs; Phase I cancer Clinical Trials; Continual Reassessment Method; 
Escalation with Overdose Control; Cumulative Cohort Design; Up-and-Down; Robustness 
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1 Introduction 



Over the past two decades, nu merous novel dose - findin s designs employing Bayesian calculati ons, such as 



conti nual reassessment method (jO'Quiglev et al 



19901 ) and escalation with overdose control ( Babb et al 



19981 ). have been developed for Phase I cancer trials. The hallmark of these designs is estimation of the dose- 
toxicity function after each cohort, in order to assign the estimated Maximum Tolerated Dose (MTD) to the 
next cohort. These "Baye sian Phase I" designs (in short: BPls) have been joined by novel non -Bayesian 



designs using this principle (jLeung and Wang 



2001 



Yuan and Chappell 



2004; 



Ivanova et al. 



20071 ). We will 



use the acronym "LMPl" (long-memory Phase I) to refer to the family of designs assigning the estimated 
MTD at each cohort, regardless of whether they employ Bayesian methods. Despite their popularity among 
statistici ans, LMPls h ad struggled to enter actual practice, where the conservative '3+3' experimental 
19731 ) , which has been repeatedly shown to possess poor properties with respect to selection 



protocol ([Carter 



of th e Maximum Toler ated Dose (MTD) for use in Phase II (jStorer 



20011 ). still do minates (IRogatko et al 



Ivy et al. (jivv et al 



20li 



20071) . 



1989; 



Reiner et al. 



1999 



Lin and Shih 



20101) , on behalf of the Clinical Trial Design Task Force of the NCI's Investigational 



Drug Steering Committee, embrace the new designs, suggesting that "...members of the boards may not be 
convinced that novel designs are better for patients. In fact, they are. " Even as clinicians turn from skepticism 
to optimism, the task of constructing a comprehensive picture of LMPl properties in theory and practice is 
far from complete. Despite the relatively small number of published BPl studies, some of these have reported 
disturbing s mall-sample behavior, prompt i ng the analysts to deyelop ad-hoc design modifications that might 



mitigate it (jNeuenschwander et al 



2008; 



Resche-Rigon et al 



20081 ). The available theoretical results on 



LMPls are partial, and mostly involves asymptotic behavior. Azriel et al. proved that no LMPl des i gn ca n 



guarantee almost -sure convergence to the MTD on the class of all dose-toxicity functions (jAzriel et al 



Lee and Cheung (|Lee and Cheung 



20111) . 



20091 ) provide a design tool that automatically produces a one-parameter 
family of curves for CRM, upon the specification of an "indifference interval" around the target toxicity 
rate. Azriel (manuscript in press) proved that the conditions used by this tool indeed guarantee convergence 



to the specified interval - a weaker result than converging to the MTD itself, but practically encouraging. 



Oron et al. ( 


Oron et al. 


Ivanova et al. , 


2007 


) is 



20111) show that a novel nonparametric "interval design" (jYuan and Chappell 



2004 : 



establishes a close asymptotic equivalence between two very different LMPls. In Oron et al.'s numerical 
examination of the convergence of one-parameter CRM and the interval design under a random sample of 
dose-toxicity curves, for both designs the majority of scenarios did not meet the requisite conditions for 
convergence to the MTD itself. Hence, it appears that with LMPls one m ust settle at best for the interval 



guarantee, rather than expect convergence to the MTD (jOron et al 



2011) 



Much less is definitively known r egarding small-sample behavior. Two interesting numerical studies 



( O'Ouiglev 



2002 : 



Paoletti et al. 



20041 ) compared the success rate of CRM designs in selecting a dose within a 
toxicity indifference-interval, to an "optimal" hypothetical experiment in which the location of each subject's 
toxicity-threshold with respect to the dose space is exactly known. One-parameter CRM performed on 
average very closely to the hypothetical complete-information experiment, on a class of randomly generated 
scenarios. Our own simulation experience (a subset of which is presented in Section |4]) suggests that several 
other designs can produce average performance roughly on par with one-parameter CRM. 

Numerical Phase I studies have focused almost exclusively upon ensemble-average performance. While 
average values are important, in practice one does not run an ensemble - but rather a single experiment. 
A case in point is the number of patients treated at the MTD, a statistic we shall refer to as n*. The 
high ensemble-average values of n* when using LMPls have been repe atedly invoked a s a decisive reason fo r 



preferring this design family (jRogatko et al 



2007 



Zohar et al 



20121 ). lasonos et al. 



lasonos et al 



|2008f ). 



perhaps the only study to date to pre sent a measure of LM Pl variability, report large standard deviations 

(120081 ). Table 2). Our numerical studies (Section g]) 



along with these high averages (ref. 



lasonos et al 



describe the complete distribution of n* under various scenarios and designs. LMPls suffer from alarmingly 
high run-to-run n* variability. 

The between-run variability is related to LMPl's overarching feature, namely the insistence upon treating 
every cohort with what is estimated to be the best possible dose at any given time. The considerable 
operational complexity of model-based LMPl designs, especially BPls, often exacerbates matters. 
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The article is organized as follows: Section [5] defines terminology and describes LMPl's operating prin- 
ciples. Section |3] presents detailed examples from published BPl experiments. Section 2] numerically 
compares CRM, an "interval design" and a short-memory "up-and-down" design. A general discussion ends 
the article. 

2 Preliminaries 
2.1 Basic Terminology 

We restrict the discussion to trials carried out as sequential dose-finding experiments with n cohorts, indexed 
c, c = f , . . . n, each cohort comprising of fc^ > 1 subjects. Except for cohort 1, the dose administered to cohort 
c is (generally speaking) not known until all observations up to cohort c — 1 are available. Yc^ the number 
of dose-limiting toxicities (DLT's) observed in cohort c, can be modeled as a Binomial random variable: 



where Xc is the dose administered to cohort c, and F is the true (and unknown) underlying toxicity 
function, assumed to be a continuous strictly increasing CDF of the response-triggering dose variable x. In 
these terms, the experiment's goal is to find Qp - the lOOp-th percentile of F. This dose is known as the 
experiment's target. In Phase I experiments, p is usually between 1/5 and 1/3. Doses themselves are 
restricted to a finite set of levels T) = — 1, . . with / usually between 4 and 10. The dose level 

closest to Qp, the final estimate of target, will typically be recommended as the MTD for Phase II. 

A generic BPl design can be described as one where in order to decide which dose to allocate to the next 
cohort, all hitherto available observations are used to estimate F via the model 



where n„ is the number of available observations at dose du , Ru is the number of those among the n„ who 



Yc ^ Binomial (fee, F{xc)) , 



(1) 



Ru ^ Binomial (riu, G {du, 9)) , u ^ 1, . . .1 



(2) 
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exhibit toxicities, and G, the model curve, is a cumulative distribution function (CDF) belonging to a 
parametric family Q indexed by a parameter vector 9 (which usual l y has a prior distribution with additional, 



fixed parameters). According to Rogatko et al. (jRogatko et al 



20071 ) ■ most BPl experiments published 



through 2006 had used a one-parameter CRM model, most often of the generic fornix 



G{du) = 4>l, 4>i < 4>2 < . ■ . < dpu 0„ e (0, 1) Vm, > 0. 



(3) 



The 0„, a sequence of constants supplied by the user, are known as the model's "skeleton." Note that 
this model form means that even though there is only one data-estimable parameter 9, there are / fixed 
parameters defining the "skeleton" , as well as additional fixed parameters involved in 0's prior distribution. 

After cohort c, BPls assign the next dose via a Bayesian posterior estimation of G at the dose levels. 
The level whose estimate is closest to target is chosen next. The most common criterion is choosing the dose 



that minimizes 



Input data to the model can be summarized as the observed sample proportions 



R 

Fu = — , u: nu> 0, 



which are the sufficient statistics for a nonparametric mode l of F 



(4) 



20061) : a more precise 



BPl designs are sometimes called "designs with memory" (lO'Quiglev and Zoharl . 
description would be long-memory designs. This is because allocation decisions arc affected by any obser- 
vation involved in the estimation step, regardless of how far back in the experiment it was collected. The 
contrast is obviously with short-memory designs, ones that only use relatively recent observations (often, 
only the last cohort) to make decisions. 

Equation ([2|) is also applicable to frequentist long-memory designs, and even to nonparametric ones. The 
latter directly use the F, which can be viewed as a special case of G. Therefore, we treat any design that 
allocates successive cohorts via estimation of a model of the general form ^ as belonging to the long- memory 



more sophisticated one- parameter model by Chevret l|Chevretl . Il993^ is described in Supplement A. 



6 



Flinn Et Al. (2000) at Experiment End, n=20 




60 70 80 

Dose (mg/sq.m./wk) 



Figure 1: The Flinn et al. ( Flinn et al. . 2000() experiment, targeting 20% toxicity (horizontal dashed line). 
Shown are observed toxicity frequencies ('X' marks) and the posterior model curve (connected 'G' marks) 
at the experiment's end. 'X' mark area is proportional to sample size at each dose. 

family, regardless of whether it employs nonparamctric, parametric or Bayesian methods. For this family we 
will use the acronym LMPl, with BPls forming a subfamily within it. 



2.2 LMPl's Operating Principle and Basic Limitations 

The LMPl allocation process is akin to fitting a regression curve, weighted by the number of observations and 
constrained by the model family Q, through the points | (^d„ , F,,, ^ | . These poi nts are the 'X's in Figure 1, 



displaying data at the end of a published CRM experiment (jFlinn et al 



20001) . The regression is fit by a 



weighted combination of the prior and the likelihood. The experiment's goal is finding the dose closest 
to the place where the true F crosses the horizontal y = p dashed line in Figure [TJ LMPls allocate 
each cohort to the best current candidate dose, according to the fitted G curve. If too many toxicities 
are observed at that dose, the corresponding 'X' mark will move higher, pulling G with it and eventually 
mandating dose reduction; and vice versa. This is the basic self-correction mechanism. Furthermore, the 
intuition that the sample proportions will eventually converg e to their true values, has been recently proven 



for generic sequential dose-finding designs (jOron et al 



201l|) 



These two elements - self-correction in the assumed direction of target, and consistency of observed toxic- 
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ity rates - form the "engine" driving LMPls. The requirements from the model are so m odest, that a model G 



for F is not even needed in order construct the "engine" . For example, interval designs (jYuan and Chappell 



2004 : 



Ivanova et al 



2007() have no model. Instead, they mandate dose escalation if F at the current dose is 
below some "tolerance interval" around p, and vice versa. 

These operating principles also dictate the relationship between G's slope and experimental trajectories. 
Shallow model curves will shift the crossing point more dramatically as G changes. Hence, they are associated 
with more volatile dose allocations, and vice versa for steep curves. Convex Q skeletons, shallow to the left 
and steep to the right, arc rather popular in practice. They are quick to descend but more conservative 
when escalating. Generally, multi-parameter models can adapt the fitted slope to the observations. 

It is important to note that this CJ-detcrmined degree of volatility is unrelated to the actual rate of 
convergence to the MTD. The latter is paced by the convergence rate of F, i.e., root-n. This is a very 
slow rate compared with typical Phase I sample sizes of 10 — 40 patients. If Q correctly specifies F, then 
all data are pooled to consistently estimate 0, providing the fastest possible convergence within the root-n 
constraints. In the more likely case of misspecification, this pooling affords little help. Convergence to the 
MTD, then, is at best constrained by the convergence of individual F's around target. Rather often, such 
convergence is not guaranteed at all. As mentioned in the Introduction, one has to settle for convergence to 
an "indifference interval" which might contain several levels. 



3 Experimental Examples 

We present in this section four published BPl experiments. Each experiment is accompanied by a figure, in 
which the left-hand frame describes the experiment's trajectory - i.e., each cohort's administered dose levels 
and the number of toxic and non-toxic responses observed for each, arranged in chronological order - and 
the right-hand frame presents the evolution of posterior model curves. For brevity's sake, model details are 
relegated to Supplement A; the trajectories of two additional experiments appear in Supplement B. 
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Dougherty et al. (2000) Experiment Trajectory 



Dougherty Et Al. (2000) Successive Toxicity Curve IModels 
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Figure 2: Experimental trajectory and dose-response curves (right) from the Dougherty et al. experiment. 
In the left frame, subjects are shown in their chronological order plotted against the administered levels; 
each empty circle represents a single negative (no-pain) response, and each filled circle represents a positive 
response. In the right frame, final empirical pain-rates (F) are shown in 'X' marks, whose size is proportional 
to the number of observations. The piecewise-linear curves represent posterior predictive toxicity estimates, 
with the number indicating the last subject before the update. The zero-symbol curve is the prior, and the 
symbols A, B and C stand for estimates after the 16th, 18th and 25th subject, respectively. The dashed 
horizontal line indicates the target response rate, in this case 0.2. 



3.1 Dougherty et aZ.'s Anesthesiology Experiment ( iDougherty et all l2000l ) 



This study (Figure [2]) was not, strictly speaking, a Phase I trial, but rather a CRM design applied to an 



anesthesiology experiment (jDoughertv et al 



20001 ). Instead of toxicity, a positive response indicate s 



The target pain rate was 0.2, and there were 25 patients treated one at a time. Chevret's (jChevret 



pain 



19931 ) 



one-parameter logistic model was used. There were 4 levels in this design, with "skeleton" pain probabilities 



set at (f) ^ (0.1,0.2,0.4,0.8). The Goodman et al. (jCoodman et al 



19951 ) constraint, forbidding escalation 



by more than one level between successive cohorts, was in effect. According to its bottom line, the experiment 
was an astounding success: 18 of 25 patients were treated at the recommended dose (^2), with a cumulative 
pain rate of 3 out of 18 - almost as close to target as possible (4 of 18 would have been slightly closer). 
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3.2 Piste rs et al. (iPisters et al.l . 12004 ) and Mathew et al. (iMathew et al.l . 



200J) 



A pair of experiments conducted at the M.D. Anderson Center and published in 2 004 targeted p = 0.3, using 



a one-parameter 'power' model CRM (Pisters et al 



single-level increment constraint (jCoodman et al 



2004; 



Mathew et al 



2004). The former followed the 



19951 ). and had 4 dose levels with prior toxicity probabil- 



ities nearly identical to Dougherty et aZ.'s: = (0.05,0.20,0.40,0.80) (Figure[3l top). After an (unplanned) 
single patient at level 1 and the first 3-paticnt cohort at level 2 (with no DLT's observed in cither), all 22 
remaining patients (8 cohorts) were assigned the third level. The observed DLT rate at that level (7/22) 
was the closest possible to target with 22 observations; not surprisingly this was the recommended MTD. 



20041 ) ■ which neglected to follow the 



The story was different for the second experiment (jMathew et al. 
single-escalation constraint. The design called for six-person cohorts, and had six levels with a relatively 
shallow "skeleton" c/) = (0.07,0.16,0.30,0.40,0.46,0.53), beginning at (Figure [3 bottom). After zero 
toxicities observed on the first cohort, allocation jumped directly to ~ where 3 out of 4 toxicities forced 
the experimenters to cut the cohort short and de-escalate to (^4. At that level, 5 toxicities out of 6 were 
observed, so the experiment descended back to ^3, where now 3 of 6 experienced DLT's. This dose, with 
a cumulative toxicity rate of 0.25, was recommended as the MTD; but not before half the patients in the 
study (11 of 22) experienced DLT's. More disturbingly, a recalculation of G according to the model indicates 
that the final MTD estimate should have been c?2, with a posterior G = 0.28 compared with 0.43 for ds . 
This level had never been assigned during the experiment. Moreover, ^2, rather than d^, should have been 
assigned to the last cohort as well (G — 0.25 and 0.40, respectively) o 



3.3 Neuenschwander et al. (jNeuenschwander et al 



20081) 



This experiment began as a one-parameter 'power' CRM, with a large number of levels, / = 15 (Figure [4]). 
The starting dose xi was di, and the single-level escalation restriction was initially in effect. The predictive 
prior placed the MTD at diQ, creating an immediate tension between posterior recommendations and dose- 
escalation restrictions. After 4 cohorts with 16 patients, cumulatively, yielded no toxicities, the posterior 
MTD was di2 and researchers agreed to skip from di to di. The next two patients both experienced 
DLT's, but CRM still recommended jumping from to dg rather than de-escalating. At this point the 



^We inquired with the consulting statistician to this study, and he could not recall the circumstances surrounding the 
decisions to overrule d2 with d^. 
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Figure 3: Descriptions of the Pisters et al. (top) and Mathew et al. (bottom) experiments, using a convention 
similar to that of Figure [2] The curve with the "A" symbols in the top right frame indicates the final 
posterior after cohort 10. 
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Neuenschwander et al. (2008) Experiment Trajectory 



Neuenschwander et al. (2008) Successive Toxicity Curve lUlodels 




Cohort Dose (mg/sq.m./wk) 



Figure 4: Trajectory (left) and posterior model curves (right) of the Neuenschwander et al. experiment. 
The dashed line after cohort 5 in the left-hand frame indicates the original allocation to cohort 6 using the 
one-parameter model. At this point both model and loss functions were replaced. 



trial was put o n hold, and intensive simulat ion and theory work comprising the bulk of the article's body 



was performed (jNeuenschwander et al 



20081 ) ■ The authors ultimately replaced the one-parameter model by 
a two-parameter logistic, and modified the decision rule to penalize toxicity more heavily. These changes 
resulted in dg (i.e., a one- level de-escalation) being recommended for the trial's continuation. All 3 remaining 
cohorts were administered that dose, that eventually became the recommended MTD with 2 toxicities 
observed on 9 patients. 



4 Numerical Demonstrations 



4.1 Overview and Methods 

In this section, we numerically examine some aspects of LMPl behavior, com pared with short-memory 
designs taken from the 'Up-and- Down' (UfcD) family ( Dixon and Mood . 



by authors in the field (e.g., by 



Rogatko et al 



2007 



Neuenschwander et al 



19481). UfcD is often conflated 



20081) with the 3-^3 proto- 



col. However, the two diverge in several important respects - first and foremost, the fact that U&D is an 
experimental design while 3-1-3, strictly speaking, is not. In the same vein, UfcD designs possess tractable the- 
oretical properties, that are sorely lacking for 3-1-3 (see Supplement C for a more detailed list of differences). 
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UfcD designs generate random walks over the dose space, with visit frequencies peaking near Qp (jPerman 



U&D designs generate 


1957; 


Tsutakawa , 


1967) 



19671 ). There has been considerable recent methodological work o n UfcD, exploring their 



properties as nonpa r ametric design s and 



1995 



Gezmu 



1996 



Ivanova et al 



d eveloping novel variat ions and extensions ([Durham and Flournov 



2003 



Oron and Hofl 



2009( ). Post-experiment estimation can be done 



in various m ethods. Nowadays, isotonic re gression is often recommended as a robust and relatively cfR- 



(jTsutakawa . 



cient choice (IStvlianou and Flournov 



1967 



Gezmu and Flournov 



2002 ) . For the runs illustrated here, we used a group UfcD design 



20061 ) with a cohort size of 2. The allocation rule is to escalate if 
no toxicities are observed, and otherwise de-escalate. This design converges to a visit distribution peaked 
near (3o.29- 

Simulation details: 

LMPl and U&D performance simulations were carried out by generating a pseudorandom ensemble of M runs 
with n toxicity thresholds each, all drawn from the same distribution. The results shown here are from a simulation 
setup with M = 1000, n = 32,/ = 6, a cohort size of 2 for all designs, and the target at Qo.s (the 30th percentile). 
We show results from six distributions (hereafter called "scenarios"), calibrated so that each of the 6 dose levels is 
the true MTD for one scenario. A ll runs started at d2. The code and subsequent analysis were implemented in R 
( R Development Core Teaml . 



2011 



For CRM the presented results used the "power" model with a "skeleton" similar to that used by Flinn et al. 
(Fig.g. Their skeleton was 4> = (0.05, 0.10, 0.20, 0.30, 0.50, 0.65, 0.80) with / = 7,and ours is = (0.05, 0.11, 0.22, 0.40, 0.60, 0.78) 
with I = 6. The prior on 9 was log-Normal, the one most commonly used in practice, and w as calibrated s o that 
initial responses be "coherent", i.e. a no-toxicity cohort will trigger an escalation and vice versa (|Cheunel . bOOSl V The 
single-level escalation constraint was universally used, in both the upward and downward directions. Further details 
(simulation scenario curves, etc.) appear in Supplement D. 



4.2 Between- Run Variability and the Order Effect 

Dose-finding simulation summaries are usually statistics of average ensemble performance, for example the 
proportion of runs for which the true MTD was found by various designs, or n* /n - the overall average 
fraction of simulated doses allocated to the true MTD. Many LMPl designs tend to perform well on these 
summaries, especially which also happens to be one of U&D's weakest aspects, being a random-walk 

design that inevitable spreads allocations over several levels. 

Rather than just report the average. Figure [5] displays the distribution of n* (excluding the first cohort) 
over the simulated ensemble. This enables us a glimpse into run-to-run variability. The ensemble average n* is 
visible as the bold vertical line in the middle of each histogram. O ne-parameter CRM (left) is compared here 



with group U&D (jTsutakawa 
(top to bottom). 



1967 



Gezmu and Flournov 



20061) (right). Shown are three of the scenarios 
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CRM, Normal Scenario 



Number of Cohorts Allocated to MTD 
CRM, Gamma Scenario 



Number of Cohorts Allocated to MTD 



CRM, Lognormal Scenario 



U&D, Normai Scenario 



Number of Cohorts Allocated to MTD 
U&D, Gamma Scenario 



Number of Cohorts Allocated to MTD 
U&D, Lognormai Scenario 



Number of Cohorts Allocated to MTD 



Number of Cohorts Allocated to MTD 



Figure 5: Between-run and between-scenario variability. The histograms depict the ensemble distribution 
of n*, excluding the first cohort. The ensemble size is 1000 runs. Scenarios are Normal (top), Gamma 
(middle) and Lognormal (bottom); designs are CRM one-parameter 'power' (left) and GU&D (right), both 
with cohort size 2. The runs were 16 cohorts long, starting at d2- 
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The most dramatic feature in Fig. [S] is CRM's between-run variability. Even under the Normal scenario 
(top left), where the ensemble mode is at a spectacular 11 MTD-allocated cohorts out of 15 and the average 
is around 8.5 cohorts, 14% of the runs ended with n* < 2. Under the Gamma scenario (middle left), CRM's 
modal n* outcome allocates zero cohorts to the MTD during the experiment. It should be noted that for 
the Gamma scenario, the MTD was actually the starting dose (^2), meaning that in one-fifth of the runs 
CRM immediately veered away from its starting dose, never to return - despite d2 being the correct MTD. 
Finally, the log-Normal scenario (bottom left) generates strongly divergent behavior, with very low or very 
high values of n* more common than intermediate outcomes. 

With U&D (Fig. O right-hand frames), between-run and between-scenario differences are far smaller. 
Due to its random-walk nature, group U&D cannot allocate more than roughly half the cohorts to any single 
level except on the boundary. However, in all scenarios the modal outcome is reasonably close to this limit 
at 5 — 6 cohorts per run, with the vast majority of runs producing n* values within ±2 of the mode. 

To help pinpoint the reason for this variability between CRM runs, we replaced the randomly-generated 
thresholds with fixed sets. For each distribution we started with a "perfect set" consisting of the percentiles 
Qi/33j • ■ • Q32/33- Such a sct would be unrealistically well-behaved. Therefore, we "knocked out" two thresh- 
olds in the vicinity of Qp, one on each side, and replaced them with replicas of Qi/33 and (332/33, respectively. 
This makes the behavior more realistic, while leaving the observable target in its true location. We then 
generated 1000 runs, using the exact same set of thresholds and permuting only the order in which they 
appear. Figure [B] shows the distributions of n* from these runs; it is impressively similar to Fig. [5l This 
establishes that CRM's run-to-run variability in n* is driven primarily by variations in sampling order. 

Variability in n* and sensitivity to sampling order are properties of all LMPl designs, not just Bayesian 
ones. Figure [7]repeats the same exercise of Fig. [5]|6] (pseudorandom draws, then permutations of an idealized 
threshold set), using Ivanova et al.'s nonparametric "cumulative cohort design" (CCD) for dose allocation 



(Ivanova et al 



20071 ). CCD is an interval design, perhaps the LMPl design type most different from CRM: 



it repeats the same dose du as long as fall s inside a tolerance i nterval around p. In the depicted runs we 



used the interval (0.2,0.4), recommended in (jivanova et al 



20071) for Z = 6. Between-scenario variability in 



n* is smaller than in CRM, but between-run variability, if anything, is even greater. 



4.3 Average Performance and Effect of Prior 

Table [1] presents the percent of runs in which the MTD was correctly selected for the three methods under 
six scenarios. We chose scenarios where the MTD is unambiguous: its true F value is always very close 
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CRM, Normal, order permutations with 2 outliers 



GU&D, Normal, order 



permutations with 2 outliers 



Number of Cohorts Allocated to MTD 
CRM, Gamma, order permutations with 2 outliers 



Number of Cohorts Allocated to MTD 

CRM, Lognormal, order permutations with 2 outliers 



Number of Cofiorts Allocated to MTD 
U&D, Gamma, order permutations with 2 outliers 



Number of Cofiorts Allocated to MTD 

U&D, Lognormal, order permutations with 2 outliers 



Number of Cohorts Allocated to MTD 



Number of Cohorts Allocated to MTD 



Figure 6: Similar to Fig. [5l except that rather than draws out of a simulated distribution, the runs are 
permutations of the same set of 32 thresholds, as described in the text. 
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CCD, Normal, Different Thresholds 



CCD, Normal, Permutations Only 



Number of Cohorts Allocated to MTD 

CCD, Gamma, Different Thresholds 



Number of Cohorts Allocated to MTD 

CCD, Gamma, Permutations Only 



Number of Cohorts Allocated to MTD 
CCD, Lognormal, Different Thresholds 



Number of Cohorts Allocated to MTD 



Number of Cohorts Allocated to MTD 
CCD, Lognormal, Permutations Only 



Number of Cohorts Allocated to MTD 



Figure 7: Distribution of n* using the exact random d raws of Figs. [5] (left) and [6] (right), under the 
nonparametric interval design CCD (jlvanova et all 120071 ). 
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Tabic 1: Bulk performance comparison between "power" CRM, CCD and group U&D. For each of six 
scenarios, compared are the proportion of runs in which the correct MTD was selected, after 8 (left) and 16 
(right) cohorts, respectively. CRM is estimated as the next dose allocation; U&D and CCD were estimated 
using centered isotonic regression. 



MTD After 8 Cohorts After 16 Cohorts 

Scenario Level CRM CCD U&D CRM CCD U&D 



Uniform" 


1 


50.2 


57.1 


54.0 


62.1 


64.1 


60.8 


Gamma" 


2 


36.6 


44.2 


40.8 


47.4 


53.2 


51.2 


Normal" 


3 


57.8 


54.2 


56.4 


67.5 


67.1 


63.0 


Lognormal" 


4 


46.7 


34.0 


33.0 


59.3 


46.2 


48.4 


Weibull" 


5 


39.0 


28.1 


38.1 


47.2 


42.6 


45.0 


Logistic" 


6 


26.0 


30.0 


32.2 


29.3 


48.5 


54.6 



to 0.3, and the F values of neighboring levels are no closer than approximately 0.2 or 0.4 (see details in 
Supplement D). 

Overall, the performance differences between these three very different designs are remarkably small: in 
four of six scenarios, after 32 subjects the methods' success rates are within 6% of each other. It is actually 
CRM that falls most conspicuously behind in the scenario targeting (bottom row), in which it shows 
nearly no improvement during the experiment's second half. One reason is that under this scenario, the 
design fails to conver ge to the MTD, b ecause the Cheung-Chappell conditions on the relationship between 



F and G are not met( Oron et al. 



201 ih 



Table [2] shows what happens to performance if we retain the same CRM "skeleton" , but change prior 
parameters. The prior used to produce Figures [5][6] and Table[T]is labeled "A" . It represents a modest amount 
of scientific knowledge and priorities: it assumes the middle of the dose range is somewhat more likely to 
contain the MTD, and that the highest doses are less likely or desirable than the lowest ones (Tabled left 
column). Prior B, which encourages dose escalation (e.g., has more prior-predictive weight than ^2 or 
da), is commonly recommended by CRM researchers as "uninformative" (it is the default prior in Cheung's 
'crm' R function). Prior C reflects a strong belief that the MTD is in the lower half of the dose range, or 
(equivalently) a reluctance to prefer higher doses until overwhelming evidence has accumulated. All priors 
used the log-Normal distribution. 

Under most scenarios, the performance variability when using the same CRM model with different priors, 
is as great or greater than the variability between methods seen in Table [1] The relative performance in 
Table [2] mirrors the MTD's relative prior-predictive weight under each prior, or more precisely: each level's 
predictive weight compared with its immediate neighbors. The performance improvement from 16 to 32 
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Tabic 2: Similar to Tablc[Tl but only with CRM, using the same "skeleton" and three different priors labeled 
A, B and C. The first three columns show each prior's predictive MTD distribution. 



Prior Weight After 8 Cohorts After 16 Cohorts 



Scenario/MTD 


A 


B 


C 


A 


B 


C 


A 


B 


C 


"Uniform" / di 


0.25 


0.26 


0.33 


50.2 


53.3 


50.2 


62.1 


63.1 


64.5 


"Gamma" / d2 


0.14 


0.10 


0.22 


36.6 


34.4 


44.3 


47.4 


45.9 


54.0 


"Normar'/ds 


0.20 


0.15 


0.25 


57.8 


52.6 


63.2 


67.5 


66.0 


72.6 


"Lognormal" / 


0.22 


0.18 


0.16 


46.7 


46.6 


45.0 


59.3 


55.4 


56.4 


"Weibuir'/ds 


0.14 


0.17 


0.04 


39.0 


39.9 


23.0 


47.2 


50.6 


29.7 


"Logistic" /de 


0.05 


0.15 


0.002 


26.0 


34.6 


0.0 


29.3 


41.0 


7.2 



subjects is around 10% — 15% in most scenarios regardless of prior; however, it is substantially slower under 
the WeibuU and Logistic scenarios - the two scenarios under which this model fails to converge to the MTD. 



4.4 "Settling" and Estimation Success 



The phenomenon of LMPl experiments settling fairly ear ly on a single dose is well-known: see, e .g., the first 



two experiments in Section|3l O'Quigley 



O'Quiglev 



20061 ) , Rogatko et al. (jRogatko et al 



20071 ). and many 



others see it as a strength. The rationale is that such a settling indicates the LMPl self-correction mechanism 
needs little further information to determine the MTD as best it can. In this numerical demonstration, we 
consider a run to have "settled" once the same dose has been assigned 5 consecutive times (excluding the 
arbitrary starting dose). Some LMPl studies had suggested a similar settling criterion as a stopping rule 



(Zohar and Chevret 



20031 ). Figure [5] divides the bulk summaries of CRM performance in each scenario into 



four groups, according to the time at which settling is first encountered - after 8 cohorts or less have been 
observed; after 9-12 cohorts; after 13-16 cohorts; or not at all. Recall that the simulation had 16 cohorts of 
size 2. Bar lengths are proportional to group sizes, and the shaded regions represent the runs pointing to 
the correct MTD at the time of settling. 

While (as shown in Fig. [5] and Table [T|) CRM performance varies strongly between scenarios, its settling 
behavior is remarkably uniform: under all scenarios, roughly half the runs encounter five consecutive identical 
allocations by cohort 8, and 80% — 90% of runs display this phenomenon by cohort 12. There is no clear 
association across scenarios between how early a run first settles, and whether it settles on the correct MTD. 
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Uniform 



Gamma 



<9 cohorts 9-12 cohorts 13-16 cohorts Not Observed 
Five Straight Identical Allocations First Observed after: 

Normal 



<9 cohorts 9-12 cohorts 13-16 cohorts Not Observed 
Five Straight Identical Allocations First Observed after; 



Weibull 



<9 cohorts 9-12 cohorts 13-16 cohorts Not Observed 
Five Straight Identical Allocations First Observed after; 



E 



3 



<9 cohorts 9-12 cohorts 13-16 cohorts Not Observed 
Five Straight Identical Allocations First Observed after; 

Lognormal 



<9 cohorts 9-12 cohorts 13-16 cohorts Not Observed 
Five Straight Identical Allocations First Observed after; 



Logistic 



<9 cohorts 9-12 cohorts 13-16 cohorts Not Observed 
Five Straight Identical Allocations First Observed after; 



Figure 8: CRM estimation performance, by scenario and "settling" . Bar length is proportional to the number 
of runs in each "settling" stage. The shaded portion represents those runs "settling" on the true MTD at 
that time. 
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5 Discussion and Recommendations 



5.1 LMPl's Variability, Order Sensitivity and "Settling" 

The order sensitivity of LMPl designs exposed in Section l4.2i leading to strong between-run variations in 
n* , is possibly reflected in the experimental examples of Section [3] When early observations are concordant 
with the rest of the experiment and with the predictive prior, the strong positive reinforcement yields a high 
n* . The first two experiments in Section [3] are ostensible examples (unlike numerical runs, here we do not 
know the true -F). Co nversely, an unfortunate set of early responses can bring n* close to zero; Mathew et al. 



(Mathew et al. 



200J) might be such a case. In that experiment, data from cohorts 2-4 (excluding cohort 1) 
strongly point towards di as the most likely MTD candidate. However, since the first cohort pointed in 
the opposite direction (cIq), the experiment was spent largely in the upper part of the dose range, and the 
model-estimated MTD (1^2) was never assigned during the experiment itself. 

Order sensitivity is an inevitable consequence of LMPl's "settling" feature mentioned in Section [4.41 
As Figure [5] demonstrates, settling occurs with surprising regularity, irrespective of eventual MTD-selection 
success. The tendency to settle is driven by LMPl's estimation-based dose-allocation procedure, and its 
underlying root-n self-correction rate (see Section 12. 2p . The likelihood calculation after cohort c -I- 1 is the 
same as after cohort c, except for the last cohort's data. Therefore, the relative change to the likelihood 
surface diminishes as the experiment progresses, except possibly for the first DLT-containing cohort. See 
the right-hand sides of Figures [2M1 Since BPls weight the likelihood with a prior, the impact of new data is 
smaller than for non-Bayesian LMPls, and settling is earlier and more pervasive. Settling behavior is likely 
more pronounced with multi-parameter models (see Supplement E). CCD behaves somewhat differently with 
respect to settling. Because it only estimates F at the current dose, its progression from volatility to settling 
is not as smooth. In the Section 2] simulation runs, CCD must leave every dose after its first visit, because 
with 2 observations there is no possible F value falling inside the interval (0.2, 0.4). If these observations are 
both DLTs, the next visit is guaranteed to end in a de-escalation regardless of DLT outcome. However, the 
salient LMPl features of root-n self-correction and eventual settling are similar, and after several cohorts at 
the same dose they become dominant. 

One could argue that LMPls are no different from other likelihood-driven estimation processes, that 
typically slow down and self-correct at a root-n rate. However, LMPls use the estimation process to direct 
and restrict the collection of future infor mation. Old er and more established sequential designs, such as the 
sequential probability ratio test (SPRT. IWaldl . 119451) . require sufficient evidence before altering or halting 
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data collection. With LMPls, on the other hand, a dose level only has to continue to be slightly more likely 
than any other, in order to receive all subsequent allocations. This "winner-take-all" property contradicts 
the low precision of the i^'s, in particular early in the experiment. 

The long memory lends the earliest cohorts (and the prior) a disproportionately large influence. Even 
though in each specific likelihood calculation all observations arc equally weighted, the first cohort partic- 
ipates in all calculations while the last one participates only in the final estimate. Moreover, in the early 
calculations each single observation is more infiuential, because only few are available. From an opera- 
tional perspective, one should keep in mind that often the earliest cohorts are more likely to contain various 
experimental mishaps due to inexperience. 

Resche-R igon et al.'s CRM experim ent, targeting p = Q.l with cohort size 1, was dominated by the 



first cohort (jResche-Rigon et al 



20081 ). A DLT in cohort 1 pushed dose assignments from down to 



di. Allocations remained at di for the remainder of the experiment, despite 0-of-lO toxicities observed. 
Calculations indicated that only after O-of-14 at di, escalation to d2 would have been finally allowed. For 
future studies, the researchers suggested imposing an ad-hoc weighting scheme on the likelihood calculations, 
discounts the impact of observations as the recede into the experiment's "past." This can be seen as a 
compromise between long and short memory, albeit with unclear properties. In a more recent Phase II dose 
de-escalation efficacy study with 5 levels, also targeti ng jp = 0.1 fa i lure r ate, the same authors decided not to 



use the past-discounting scheme they had developed (jZohar et al 



20121 ) . Two failures with patients number 



6 and 10 pushed the experiment up to the d^ — d^ range for the next 10 patients. When the 25-patient trial 
was over, estimates suggested that either d^ (6 patients treated) or d2 (only 1 patient treated) is the MED. 
Still, the authors mentioned in the conclusions that using CRM leads to higher n* values than '3+3.' 



5.2 BPl Vulnerability: Model-Related Artifacts 



The order effect in Mathcw et al. ( Mathew et al 



20041 ) was complicated by a shallow family of model 



curves, producing a very volatile trajectory. All three dose-transition recommendations in that experiment 
were of multi- level jumps (d^ to d^, to d^, to ^2)- The MTD est imate after cohort 4 was identical to its 



20121 ) is another example of 



predecessor, indicating the possible onset of settling. Zohar et. al. (jZohar et al. 
a shallow curve producing strong early volatility followed by settling. In the first 10 dose assignments there 
were 6 dose transitions, 3 of them multiple-level jumps (this experiment did not implement a dose-cscalation 
restriction). The remaining 15 assignments, including the final estimate, had only 2 transitions altogether. 
Steep model slopes lead to stagnant trajectories, and therefore settling tends to dominate even earlier. 
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Neuenschwander et al. ( Neuenschwander et al 



20081 ) attempted to work with a realistic, sigmoid-shaped 



skeleton (Fig. |31 right). Furthermore, there was no adverse order effect: observations from different cohorts 
were within reasonable in agreement. Yet, the experiment did encounter a very real contradiction after 
cohort 5. The model's "intuition", moving the MTD from di2 to dg due to that cohort's 2-of-2 DLTs, collided 
with the basic intuition of practitioners, who observed the DLTs at dj, yet wer e instructed by the model to 



20051 ). who proved 



escalate to dg. Such an escalation decision was called "incoherent" by Cheung (|Cheung . 
that it is impossible with one-parameter CRM - but only if one is already at the estimated MTD to begin 
with. Since that experiment started far below the prior-predictive MTD, it was at a high risk of running into 
incoherent assignments. There might be found a pre-specified escalation path that reduces or eliminates such 
a risk (Cheung, personal communication). However, unless one changes the field's toxicity-averse priorities, 
such occurrences are probably unavoidable with BPls. The protocol might mandate starting at a low dose, 
while the design and scientific information might point towards a higher dose as the prior MTD, and the 
single-dose escalation constraint limits the rate in which this gap can be closed. Neuenschwander et al. could 
have reduced the risk by using less dose levels, starting at a somewhat higher dose, or creating a skeleton 
that rises somewhat faster - but to their defense, th e "non-coherence" risk is not clearly warne d against in 



articles and tutorials promoting the merits of BPls ([Garrett- Mavei 



2006 : 



Rogatko et al 



20071 ). And even 



the sigmoid shape was not flexible enough to capture the rather obvious data patterns at the experiment's 
end: the original CRM design still preferred d% over the recommended-MTD dg, because the steep increase 
in observed toxicities occurred around — d^, rather than the skeleton's dg — di2 (Fig. 21 right, curve '8'). 

Experienced BPl designers now often prefer a specific curve shape: convex Q skeletons of the type we used 
above in Section |4l allowing for volatility at lower doses and more conservative at higher doses. But unless 
the target toxicity rate is very low, these skeletons are a scientifically unrealistic description of a distribution 
of toxicity thresholds, because it pushes the steep increase (representing the bulk of the toxicity-thrcshold 
population) to the upper edge of the dose space. In a similar vein, CRM with the relatively "uninformative" 
Prior B performs more evenly in terms of average MTD-selection success, compared with the moderately- 
informative Prior A (see Tabled both priors are equally sensitive to order - see a plot analogous to Fig. [5] in 
Supplement D.). However, we wanted to emulate a typical Bayesian scenario, where clinicians are asked for 
their scientific insight, and analysts reflect that insight in the speciflcation. The difference in performance 
between two not-so-different priors, and the experience of Neuenschwander et al., demonstrate how easy it is 
to encounter undesir able behavior wh e n att empting to incorporate scientifle information into BPl models. 



Lee and Cheung (jLee and Cheune 



20091 ) describe the challenge of looking for operational BPl "comfort 



23 



zones" as a time-consuming search in multidimensional space. As mentioned in the Introduction, they 
offer an algorith m calculating a model s keleton using an "indifference interval" around p as input. In 



subsequent work (|Lee and Cheung 



20111) . they develop an automatic calculation of a "least informative" 
prior. These developments make CRM somewhat simpler and more transparent, at the price of lowering 
expectations with respect to the method's capabilities. Nonparametric "interval designs" such as CCD offe r 



2Q1M, 



the same capabilities: guaranteed convergence to within a similarly-specified interval (jOron et al 
and a completely uninformativc prior - in fact, no prior at all - without any need to specify skeletons and 
parameter distributions, or to consult with sophisticated design tools. One only needs to specify an interval 
and a set of dose levels. So why bother with the additional complexity? 



5.3 Recommendations and Future Directions 

The rather minor differences in average MTD-selection performance between radically different designs (Ta- 
ble [H suggest that, on this measure, all of them are not too far from the attainable maximum - given 
the limited information provide d by the observat i ons. This sheds new light upon O'Quiglcy et a/.'s study, 



mentioned in the Introduction (jO'Quiglev et al 



20021) . Doubtlessly, a few more percents can be added 



via methodological improvements. But short of increasing n well beyond current conventions, or replacing 
binary outcomes with something far more informative, we are destined to remain approximately in the range 
of values appearing in Table [TJ 

This focuses our attention upon various aspects of robustness, such as run-to-run variability. The high 
average values of n* statistic when using LMPls offer little comfort to practitioners, when accompanied 
by the very high variability uncovered in our simulations. We reiterate that LMPl's variability and order- 
sensitivity are not Baycsian or model-related properties: the model-free CCD displayed n* variability and 
sampling-order sensitivity to the same degree as the parametric CRM (Fig. [7|) . Supplement E presents n* 
distributions from a two-parameter BPl design. The variability is just as bad. 

It is often argued that BPls in their most prevalent form described here, are not truly Bayesian, since 
they use the same optimization procedure for dose allocation (where the goal is information gathering under 
toxicity constraints) and for MTD selection (w here the goal is e stimation). However , interesting recent 



attempts to modify BPl dose- assignment rules (|Ji et al 



2007a b 



Yin and Yuan 



20091) will probably not 



resolve order sensitivity, unless the underlying l oss function is m odified to di s coura ge a winncr-take-all 



solution. Bartroff and Lai (jBartroff and Lai . 



20101 ) and Azriel et al. ( Azriel et al 



20111) , both writing about 



"the treatment vs. experimentation dilemma" , each offered a new design. We have been able to examine the 
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Azriel et al. design, and it does not alleviate the variability in n* (see figure in Supplement E). We suggest 
that studies of future designs always include such an examination, which is rather easy to carry out. 

By no means should this article be viewed as advocating a return to 3+3. While that protocol is simple 
and restricts toxicities, the latter property causes it to stay mostly below the MTD. Furthermore, 3+3's 
stopping dose is not a statistical estimate. If one is compelled to follow a 3+3 or similar protocol, then a 
statistically appropriate post-experim e nt est imate should rely upon the final F values, via isotonic regression 



( Stvlianou and Flournov 



2002 : 



Oron . 



20071) or a parametric (e.g., logistic) regression. Among established 
design options, we feel that the best combination of performance, reliability and guaranteed properties is 
currently offered by 'up-and-down' - a design family that is extensively used in science and engineering. 
U&D's rapid (geometric-rate) convergence to asymptotic behavior is more compatible with Phase Fs small 
samples than LMPl's root-n rate. Moreover, U&D's short memory is more forgiving towards discrepancies 
between early and late observations. Since U&D generates a random walk, the dose at which a U&D 
experiment happened to end is not the MTD estimate. While excursions to high-toxicity regions are a 
feature of random walk, with U&D the expected toxicity rate over the entire experiment is approximately 
p - which is by definition a tolerated rate. Hence, communicating U&D's toxicity risk to participants is 
simp le and accurate. Recent d iscuss ions and im plementation recommendations for U&D can be found in 
refs. 



Pace and Stvlianoul (|2007l ) and (jOron 



20071 Ch. 2-3). 

Another alternative is to build upon the attractive properties of U&D, and use LMPl's estimation 
potential to restrict U&D excursions. Combining the two approaches i s not a new idea: LMPls already 



utiliz e a U&D start-up stage, via the well-known "two-stage" approach (|Storer 



1989 



2001 



lasonos et al 



2008]). The most common opening stage is a run of single-patient cohorts, escalating until the first toxicity 



19481 ). While this two- 



identical to the beginning of the original median-targeting U&D (jDixon and Moodl . 
stage solution is convenient and simple, it is not a very judicious combination of the two design families, and 
the risk of early settling and order sensitivity is not substantially reduced. A hybrid U&D-LMPl approach 
was developed by Na rayana in the 1950s for median-targeting designs, and recently expanded by Ivanova ct al. 



(Ivanova et al 



2003f ). Anothe r hybr i d des ign incorporating U&D in the role of SPRT's "continue sampling" 



option was presented in ref. ijOron 



20071 Ch. 5). It succeeds in increasing n* on average compared with 
U&D, while retaining low variability. Average estimation success is also improved (see Supplement E). The 
"Narayana" design can be seen as a simplistic version of this approach. This is an area of ongoing research. 
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Supplement for Oron & Hoff, "Small- Sample Behavior... 



A. The IChevretl ( 119931 ) One-Parameter CRM Model 

Most current and published CRM exp eriments have used the "power" model described in the article's Sec- 
tion 2 .1. However, some studies such as lFlinn et alj (120001) fused for th e article's Fig. 1) and lDoughertv et al 



(|2000l ) use a more sophisticated version developed by IChevreti (jl993[ ) . It is sometimes misunderstood as a 
one-parameter logistic model with the location parameter fixed. In fact it uses a one-parameter logistic 
"skeleton" , somewhat re-parametrized - and then transforms it horizontally^ i.e. on the dose scale. The 
skeleton parametrization is 



r(0 = 1 + cxp [/3o - ei\ 



(5) 



The data-estimable parameter B affects both location and scale. However, the curve is logistic only when 
plotted vs. the transformed doses ^u, which are related to the original dose levels du via 



(6) 



where 0„ is initial toxicity-rate estimate at (i„ according to researchers' prior knowledge, as in the ordinary 
"power" model, and is the prior mean of B. Thus, the are found by back-calculation. Ostensibly this 
allows for the same flexibility as with the "power" model while maintaining coherent curve that is often used 
in literature. However, the lateral transformation makes it hard to envision the final dose-toxicity curve. 



B. Trajectories of Two Additional Experiments 



Fig. El sho ws the trajectories of tw o recent Japanese CRM studies 
obeyed the 



Morita et al. 



Goodman et al 



|2Q07l left) targeted p = 0.2, 



(|l995f ) constraint, and had three levels with x\ = d2. To control toxicities, most 
cohorts at were limited to single patients, compared with two patients per cohort at lower levels. After 
no DLT's on cohort 1 the dose was escalated, with the first DLT observed on the third patient at da (fifth 
from the start). Howev er, CRM (whic h used the consensus of four "power" models with different skeletons, 
perhaps a precursor to lYin and YuanI (j2009r )'s BMA work) still prescribed a repeat of d^, which produced 
one more DLT- free patient followed by another DLT. The final six patients were treated at d2, with 2 DLT's. 
The experiment ended after 13 patients due to "settling", recommending d2. Even though the observed DLT 
rate eventually exceeded the target rate p on both levels 2 and 3 (2/8 and 2/5, respectively ), di was never 
allocated during the experiment. Another recent CRM study from Japan (|Saii et al.l . 120071 right) is shown 
here not for its model properties, but because the sequential allocation to all of this trial's 6 cohorts would 
have been completely identical, had the researchers used the '3-1-3' protocol instead. 
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Morita et al. (2007) Experiment Trajectory 



Saji et al. (2008) Experiment Trajectory 



Figure 9: Trajectories of the iMorita et al.l (|2007t ) and ISaii et al.l (|2007l ) experiments, using the conventions 
from Figure 1 of the main article. 



C. Similarities and Differences between '3+3' and Up-and-Down 

Even though we have not found definite historical proof, is quite likely that the '3+3' pro tocol is inspired b y 
group Up-and-Down (GU&D) designs. These designs has been in use since the 1960's (ITsutakawai Il967l) . 
and this is also the time frame when '3-1-3' begins to make its appearance. In GU&D designs, a cohort 
of k subjects is treated simultaneously. If b or more toxicities are observed, treatment descends one level 
down. If a or less are observed, it escalates one level up. Otherwise, the next cohort receives the same 
treatment. Obviously, < a < 6 < fc. Ch oice of stopping rules and e stimation method is left up to 
the researchers' (and consultants') discretion. ICezmu and FlournoyI (j2006l ) introduce the useful shorthand 
terminology GU&D(j, ^ to describe any design of this family. 



The version of '3-1-3' most commonly quoted nowadays run as follows (jRosenberger and Hainesl . 



20021) 



1. Start at the lowest (or sometimes second- lowest) level. 

2. Treat cohorts of 3 subjects at a time. 

3. If this is the first cohort at the present level do as follows: if no toxicities are observed, escalate; if 2 
or 3 are observed, descend; if 1, treat another cohort at the same level. 

4. If this is the second cohort at the present level, consider all 6 subjects. If 2 or more toxicities were 
observed, descend. Otherwise escalate. 

5. If a third cohort is mandated for any given level, then the experiment stops. 

6. The MTD estimate is the highest level such that < 1/3. 
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Some variants use even more aggressive stopping rules, such as stopping the experiment after encountering 
a level with 2 toxicities out of 6 and declaring the next-lowest level the MTD, or (similarly) declaring a level 
with 1 of 6 to be the MTD. 

The beginning of a '3+3' experiment looks just like a GU&D(3 0,2) (which, by the way, targets Qo.347)- 
However, upon visiting the same level a second time, the next transition decision is changed to something 
like a GU&D^g q 2) transition (targeting Qo.isi) ~ with the important distinction that in a GU&D experiment 
the decision is based only upon the current cohort and not upon less recent ones (a genuine GU&D(-g g 2) 
would treat all 6 subjects at once) . Furthermore, each time the experiment visits a new level, decision rules 
revert to the GU&D(3_o,2)-like stage. 

In summary, these are the major differences between '3+3' and the U&D family: 

• First, '3+3' switches mid-experiment back and forth between 1-cohort and 2-cohort transition rules. 

• Second, the 2-cohort rule does not necessarily involve the 2 most recent cohorts. 

• Third, the two rules (when used each exclusively) target different percentiles. 

• Even more importantly, '3+3' has aggressive stopping rules prohibiting the administration of any single 
dose to more than 2 cohorts; U&D designs have no such constraint. 

• These differences combine to spoil random-walk properties. Unlike U&D, one cannot describe the 
trajectory of a '3+3' as a simple Markovian random walk with tractable asymptotic behavior (even 
though '3+3' is still a stochastic design, inasmuch it is a design). 

• Last but not least, the '3+3' MTD estimate is usually the stopping dose or the one below it. With 
U&D designs, the estimate is not related to the last administered dose, but is instead calculated using 
information g athered from all the experiment's tria l s, via some averaging scheme or isotonic-regression 



interpolation (jStvlianou and Flournovl . l2002t lOronl . 120071 ) 



D. Supplementary Simulation Information 

Model Curves and Simulated Curves 



The simulation results presented in the article follow the format used in lOronI (|2007l Ch. 4) . Rather than 
use arbitrarily chosen, rounded toxicity values at the dose levels (as is often done in BPl simulation), we 
preferred to simulate F using standard distributions, which approximate scenarios that can be realistically 
encountered in practice, and which are commonly used to model dose-response dependence. Curve families 
used include Logistic, Normal, Gamma, WcibuU, Lognormal and uniform. Dose levels were always uniformly 
spaced. The CRM model details were provided in the article body. 

Figure [TU] shows toxicity curves from the simulation setup having I = 6 uniformly-spaced levels, and 16 
cohorts of fc = 2 subjects each per run. The figure also shows the model curve that matches F exactly at 
the MTD. We chose 6 scenarios that are sufficiently different, realistic, present different levels of challenge 
to the CRM designs, and also have different levels as MTDs. As specified in the article, the one-parameter 
"skeleton" used was 4> = (0.05, 0.11, 0.22, 0.40, 0.60, 0.78), which is equivalent to a logistic curve with location 
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Uniform Gamma Normal 




123456 123456 123456 

Dose Level Dose Level Dose Level 



Lognormal Weibull Logistic 




123456 123456 123456 

Dose Level Dose Level Dose Level 



Figure 10: Toxicity curves (solid lines) for the six scenarios. The dashed hues show the CRM 'power' model 
curves that match each toxicity curve exactly at the MTD. MTDs are levels 1 through 6, increasing from left 
to right and top to bottom. The target toxicity rate itself {p = 0.3) is indicated via horizonal dotted lines. 



parameter /i = 0.75 and scale parameter a — 0.2, if we assign to the dose levels the evenly spaced numerical 
values {1/6, 1/3, . . . , 1}. This skeleton closely resembles the convex ones preferred by many CRM researchers. 
The prior distribution on the single data-estimated 'power' parameter was Lognormal. Prior A (the main 
one used to produce all figures) had the Lognormal parameters ^ = —0.2, a = 0.85. Prior B that placed 
more weight on higher level, pushing the runs more aggressively upward, used fi — 0.0, a = 1/1.34. This, 
by the way, is the "default" prior in the crm R function by Cheung. It is preferred by researchers who use 
convex skeletons, because on these skeletons it tends to produce a nearly-uniform predictive prior for the 
MTD. Prior C rcfiects a strong belief that the MTD is not at the higher doses. It uses /i = —0.5, a = 0.6. 

As can be seen, t he Uniform and Normal sce narios are matched very closely by the model, which indeed 
meets the restrictive IShen and O'QuiglevI (|l996l ) convergence crit eria for these scenarios . The Gamma and 



Lognormal scenarios only meet the relaxed criteria suggested bv ICheung and Chappelll (|2002l ). indicating 
slower convergence; while the Weibull and Logistic scenarios are not guaranteed to converge to the MTD. 



E. Run-to-Run Variability of Some Additional Models 

The article makes heavy use of one-parameter CRM experiments and examples, because CRM is by far 
the most well-known and implemented LMPl design. However, the order sensitivity and n* variability are 
universal LMPl features. This is exemplified in the article via CCD examples. Figure [TT] shows examples 
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2-Par. BP1, Another Normal Scenario 



Number of Patients Allocated to MTD 
2-Par. BP1, Another Gamma Scenario 



Number of Patients Allocated to MTD 



U&D, Another Normal Scenario 



Number of Patients Allocated to MTD 
U&D, Another Gamma Scenario 



Number of Patients Allocated to MTD 



Figure 11: A plot analogous to the main article's Figure 5 (distribution of n*), but from a different simulation 
run that included a 2-parameter logistic BPl (left) and a fc-in-a-row U&D design(right). Simulation details 
are in the text 



from a simulation that included a two-parameter BPl using a location-scale logistic model. That simulation 
had 500-run ensembles with a cohort size of 1. There were I = 8 dose levels, and the UfcD design used was 



"/c-in-a-row" ( Gezmu 



1996| ) - a design with single-patient cohorts, that was proven bv lOron and Hofn (|2009r ) 



to converge faster than the group U&D design used in the main article. 

Shown are the n* distributions of the first 20 allocations of 2-parameter BPl (left) and U&D (right), 
under two scenarios. These are not the same "Normal" and "Gamma" scenarios as in the main article's 
Section 4. Both MTDs are fairly easy to detect in terms of dose spacing. However, the "Gamma" MTD's 
2-parameter prior-predictive weight was smaller than the "Normal" . As suggested in the article (Section 
5.1), the more parameters in the curve, the more influential the prior might become. Thus, the 2-parameter 
runs settle earlier than 1-parameter CRM runs (5-cohort settling observed by the 10th allocation in 45% 
of Normal runs and 42% of Gamma runs, compared with 33% and 32%, respectively, with one parameter 
CRM). In any case, the two-parameter BPl variability in n* is strikingly similar to that observed with 
one-parameter designs, while U&D is robust by comparison, both between runs and between scenarios. 

Figure [T2| returns to the simulation design used in the article. On the left is the same one-parameter 
"power" model from the article, but with the "default prior" (Prior B). The variability in n* is practically 
identical to that observed in the article's Figure 5. 



The right-hand-side of Figurc ll2l shows the n* distributions on the same scenarios, for 



Azriel et al 



(l201ll) 's 



Random Allocation Design (RAD). RAD takes a nonparametric LMPl known as "isotonic regression design" 
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CRM with Prior B, Normai Scenario 



Azriel et al.'s RAD, 



Normal Scenario 



Number of Cohorts Allocated to MTD 

CRM Willi Prior B, Gamma Scenario 



Number of Cotiorts Allocated to MTD 

CRM with Prior B, Lognormal Scenario 



m I h 



Number of Cohorts Allocated to MTD 
Azriel et al.'s RAD, Gamma Scenario 



Number of Cohorts Allocated to MTD 
Azriel et al.'s RAD, Lognormal Scenario 
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Number of Cohorts Allocated to MTD 



Figure 12: A plot identical to the main article's Figure 5 (distributio n of n*), except th at here the two 
compared designs arc CRM with Prior B, the "default" prior (left), and I Azriel etal] (|201l[ )'s RAD (right). 
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Oron's Hybrid Design: CRiVI+U&D, Normai Scenario 



Oron's Hybrid Design: CCD+U&D, Normal Scenario 
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Number of Cohorts Allocated to MTD 
Oron's Hybrid Design: CRI\/I+U&D, Gamma Scenario 



Number of Cohorts Allocated to MTD 
Oron's Hybrid Design: CCD-i-U&D, Gamma Scenario 
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Number of Cohorts Allocated to MTD 
Oron's Hybrid Design: CRI\/I+U&D, Lognormal Scenario 



Number of Cohorts Allocated to MTD 
Oron's Hybrid Design: CCD+U&D, Lognormal Scenario 
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Figure 13: A plot identical to the main article's Figure 5 (distribution of n*), but with OronI (2007)'s hybrid 
design, combining U&D with CRM (left) and CCD (right). 



([Leung and Wangj . 120011 ). and adds randomization. The original design is closely related to CRM, allocating 
to the level whose isotonic-rcgrcssion F estimate is closest to p. Under RAD, the next cohort might be 
assigned instead to the dose on the opposite side of target, according t o a random draw w hose probability is 
inverse to n. While the "isotonic regression design" does not converge. 



Azriel et al. 



(|2011[ ) proved that RAD 

does converge in probability to the MTD (but not almost surely). However, as Figure [12] (right) shows, RAD 
behaves very poorly in terms of n* variability. Its average estimation performance for small samples is also 
unimpressive (data not shown) . This suggests that while a simple weakening of the "winner-take-all" rule 
can lead to better convergence, a more careful modification than blind randomization is needed in order to 
improve small-sample beha vior. 



One such modification ([Oronl [2007L Ch. 5) combines U&D with an LMPl design. Unlike RAD's ran- 
domization, which inevitably leads to "non-coherent" assignment decisions (i.e., escalation following DLTs 
and vice versa), here the non-LMPl rule is U&D, which is in fact the design's default. LMPl can override 
the U&D assignment, only if the override passes a test of confidence. For example, if CRM indicates staying 
at du, while U&D indicates escalation to the experiment will escalate unless the combined MTD- 

predictive-posterior weight of all levels above du is less than a fixed confidence threshold /? : < /3 < 0.5. 
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Table 3: Bulk performance of two hybrid designs (U&D+CRM and U&D+CCD), compared with the best 
of the main article's Tabic 1 (Section 4.3). For each of six scenarios, compared are the proportion of runs in 
which the correct MTD was selected, after 8 (left) and 16 (right) cohorts, respectively. The hybrid design is 
estimated using centered isotonic regression. 



After 8 Cohorts After 16 Cohorts 

Scenario U&D+CRM U&D+CCD Tbl. 1 Best U&D+CRM U&D+CCD Tbl. 1 Best 



"Uniform" 


53.6 


55.7 


57.1 


64.7 


63.0 


64.1 


"Gamma" 


43.8 


41.8 


44.2 


51.5 


51.5 


53.2 


"Normal" 


56.9 


55.5 


57.8 


68.6 


67.3 


67.5 


"Lognormal" 


44.5 


36.1 


46.7 


55.5 


48.5 


59.3 


"Weibull" 


36.0 


35.1 


39.0 


46.1 


46.3 


47.2 


"Logistic" 


23.7 


25.4 


32.2 


34.9 


51.0 


54.6 



For non-Bayesian designs, probability calculations or p-values are used instead of the posterior. Generally, 
early in the experiment the U&D rule will be used exclusively, and gradually more LMPl decisions will be 
accepted. Lower values of f3 are conservative, while values close to 0.5 are aggressive. 

Figure [13] and Table [3] show rcsuhs of a U&D-CRM combination with 13 = 0.25, and a U&D-CCD com- 
bination with 13 — 0.35. The overall distribution of n* using this design remains similar to U&D's (compare 
with Fig. 5, main article), but its center is shifted some 1-2 cohorts to the right. Average performance - 
especially after 32 patients - is also somewhat improved and less variable between scenarios, compared with 
CRM, CCD or U&D alone. 
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